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Abstract 

Random forests are ensemble methods which grow trees as base learners 
and combine their predictions by averaging. Random forests are known 
for their good practical performance, particularly in high-dimensional set¬ 
tings. On the theoretical side, several studies highlight the potentially 
fruitful connection between random forests and kernel methods. In this 
paper, we work out in full details this connection. In particular, we show 
that by slightly modifying their definition, random forests can be rewrit¬ 
ten as kernel methods (called KeRF for Kernel based on Random Forests) 
which are more interpretable and easier to analyze. Explicit expressions of 
KeRF estimates for some specific random forest models are given, together 
with upper bounds on their rate of consistency. We also show empirically 
that KeRF estimates compare favourably to random forest estimates. 

Index Terms — Random forests, randomization, consistency, rate of con¬ 
sistency, kernel methods. 

2010 Mathematics Subject Classification : 62G05, 62G20. 


1 Introduction 

Random forests are a class of learning algorithms used to solve pattern recogni¬ 
tion problems. As ensemble methods, they grow many trees as base learners and 
aggregate them to predict. Growing many different trees from a single data set 
requires to randomize the tree building process by, for example, sampling the 
data set. Thus, there exists a variety of random forests, depending on how trees 
are built and how the randomness is introduced in the tree building process. 

One of the most popular random forests is that of Breiman (2001) which grows 
trees based on CART procedure (Classification and Regression Trees, Breiman 
et ah, 1984) and randomizes both the training set and the splitting directions. 
Breiman’s (2001) random forests have been under active investigation during 
the last decade mainly because of their good practical performance and their 
ability to handle high dimensional data sets. Moreover, they are easy to run 
since they only depend on few parameters which are easily tunable (Liaw and 
Wiener, 2002; Genuer et ah, 2008). They are acknowledged to be state-of-the-art 
methods in fields such as genomics (Qi, 2012) and pattern recognition (Rogez 
et ah, 2008), just to name a few. 

However, even if random forests are known to perform well in many contexts, 
little is known about their mathematical properties. Indeed, most authors study 
forests whose construction does not depend on the data set. Although, consis¬ 
tency of such simplified models has been addressed in the literature (e.g., Biau 
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et al., 2008; Ishwaran and Kogalur, 2010; Denil et al., 2013), these results do not 
adapt to Breiman’s forests whose construction strongly depends on the whole 
training set. The latest attempts to study the original algorithm are by Mentch 
and Hooker (2014) and Wager (2014) who prove its asymptotic normality or by 
Scornet et al. (2014) who prove its consistency under appropriate assumptions. 

Despite these works, several properties of random forests still remain unex¬ 
plained. A promising way for understanding their complex mechanisms is to 
study the connection between forests and kernel estimates, that is estimates m n 
which take the form 


m n (x) 


Er =1 ^(Xi,x) 
£ILi^(x.x) 5 


(i) 


where {(X^, Y^) : 1 < i < n} is the training set, (Kk)k is a sequence of kernel 
functions, and k (fc G IN) is a parameter to be tuned. Unlike the most used 
Nadar aya-Wat son kernels (Nadaraya, 1964; Watson, 1964) which satisfy a ho¬ 
mogeneous property of the form iGi(X^x) = iF((x — X^)//i), kernels are 
not necessarily of this form. Therefore, the analysis of kernel estimates defined 
by (1) turns out to be more complicated and cannot be based on general results 
regarding Nadar aya-Wat son kernels. 

Breiman (2000) was the first to notice the link between forest and kernel meth¬ 
ods, a link which was later formalized by Geurts et al. (2006). On the practi¬ 
cal side, Davies and Ghahramani (2014) highlight the fact that a specific ker¬ 
nel based on random forests can empirically outperform state-of-the-art kernel 
methods. Another approach is taken by Lin and Jeon (2006) who establish the 
connection between random forests and adaptive nearest neighbor, implying 
that random forests can be seen as adaptive kernel estimates (see also Biau and 
Devroye, 2010). The latest study is by Arlot and Genuer (2014) who show that 
a specific random forest can be written as a kernel estimate and who exhibit 
rates of consistency. However, despite these works, the literature is relatively 
sparse regarding the link between forests and kernel methods. 

Our objective in the present paper is to prove that a slight modification of 
random forest procedures have explicit and simple interpretations in terms of 
kernel methods. Thus, the resulting kernel based on random forest (called KeRF 
in the rest of the paper) estimates are more amenable to mathematical analysis. 
They also appear to be empirically as accurate as random forest estimates. To 
theoretically support these results, we also make explicit the expression of some 
KeRF. We prove upper bounds on their rates of consistency, which compare 
favorably to the existing ones. 

The paper is organized as follows. Section 2 is devoted to notations and to the 
definition of KeRF estimates. The link between KeRF estimates and random 
forest estimates is made explicit in Section 3. In Section 4, two KeRF estimates 
are presented and their consistency is proved along with their rate of consistency. 
Section 5 contains experiments that highlight the good performance of KeRF 
compared to their random forests counterparts. Proofs are postponed to Section 
6 . 
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2 Notations and first definitions 

2.1 Notations 

Throughout the paper, we assume to be given a training sample V n = {(Xi, Yi), 
. ..,(X n ,Y n )} of [0, l] d x M-valued independent random variables distributed 
as the independent prototype pair (X, Y), where E[Y 2 ] < oo. We aim at 
predicting the response Y, associated with the random variable X, by estimating 
the regression function m(x) = E[Y|X = x]. In this context, we use infinite 
random forests (see the definition below) to build an estimate : [0, l] d —>• R 

of m, based on the data set V n . 

A random forest is a collection of M randomized regression trees (for an overview 
on tree construction, see e.g., Chapter 20 in Gyorh et ah, 2002). For the j-th 
tree in the family, the predicted value at point x is denoted by m n (x, 0j), where 
0i,..., ©m are independent random variables, distributed as a generic random 
variable 0, independent of the sample V n . This random variable can be used 
to sample the training set or to select the candidate directions or positions for 
splitting. The trees are combined to form the finite forest estimate 




By the law of large numbers, for all x G [0, l] d , almost surely, the finite forest 
estimate tends to the infinite forest estimate 


m oo, n (x) = E e [m n (x, 0)] 


where E© denotes the expectation with respect to 0, conditionally on V n . 

As mentioned above, there is a large variety of forests, depending on how trees 
are grown and how the random variable 0 influences the tree construction. For 
instance, tree construction can be independent of V n (Biau, 2012). On the other 
hand, it can depend only on the X^’s (Biau et ah, 2008) or on the whole training 
set (Cutler and Zhao, 2001; Geurts et ah, 2006; Zhu et ah, 2012). Throughout 
the paper, we use three important types of random forests to exemplify our 
results: Breiman’s, centred and uniform forests. In Breiman’s original proce¬ 
dure, splits are performed to minimize the variances within the two resulting 
cells. The algorithm stops when each cell contains less than a small pre-specified 
number of points (typically between 1 and 5; see Breiman, 2001, for details). 
Centred forests are a simpler procedure which, at each node, uniformly select a 
coordinate among {1,..., d} and performs splits at the center of the cell along 
the pre-chosen coordinate. The algorithm stops when a full binary tree of level 
k is built (that is, each cell is cut exactly k times), where k G IN is a parameter 
of the algorithm (see Breiman, 2004, for details on the procedure). Uniform 
forests are quite similar to centred forests except that once a split direction is 
chosen, the split is drawn uniformly on the side of the cell, along the preselected 
coordinate (see, e.g., Arlot and Genuer, 2014). 
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2.2 Kernel based on random forests (KeRF) 

To be more specific, random forest estimates satisfy, for all x G [0, l] d , 





3 = 1 


where A n (x, Qj) is the cell containing x, designed with randomness Qj and data 
set V n , and 


n 

N n (x,9j) = TTxiSA„(x,e,-) 

i=1 


is the number of data points falling in A n (x, Qj). Note that, the weights 
Wij rTl (x.) of each observation Y{ defined by 


Wi,i,n(x) 


N n (x, Qj) 


depend on the number of observations 7V n (x, Qj). Thus the contributions of 
observations that are in cells with a high density of data points are smaller than 
that of observations which belong to less populated cells. This is particularly 
true for non adaptive forests (i.e., forests built independently of data) since the 
number of observations in each cell cannot be controlled. Giving important 
weights to observations that are in low-density cells can potentially lead to 
rough estimates. Indeed, as an extreme example, trees of non adaptive forests 
can contain empty cells which leads to a substantial misestimation (since the 
prediction in empty cells is set, by default, to zero). 

In order to improve the random forest methods and compensate the misesti¬ 
mation induced by random forest weights, a natural idea is to consider KeRF 
estimates defined, for all x G [0, l] d , by 


m M ,n(x, 01,..., ©m) 


1 

Ejll Nn(x,Qj) 


M n 

N Ti * r ilx i eA n (x,e J )- 

j =i *=i 


( 3 ) 


Note that mM,n( x , ©i,..., @m) is equal to the mean of the Yf s falling in the 
cells containing x in the forest. Thus, each observation is weighted by the 
number of times it appears in the trees of the forests. Consequently, in this 
setting, an empty cell does not contribute to the prediction. 

The proximity between KeRF estimates rriM,n and random forest estimates will 
be thoroughly discussed in Section 3. As for now, we focus on (3) and start 
by proving that it is indeed a kernel estimate whose expression is given by 
Proposition 1. 

Proposition 1. Almost surely, for all x G [0, l] d , we have 


m M ,n(x, 0 i,...,0m) 


£?=! YiKM,n(x>Xi) 
ELl ^M,n(x,X*) ’ 


( 4 ) 
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where 


K M ^ n (x, z) 


1 

M 


M 

3 = 1 


We call KM,n the connection function of the M finite forest. 


( 5 ) 


Proposition 1 states that KeRF estimates have a more interpretable form than 
random forest estimates since their kernels are the connection functions of the 
forests. This connection function can be seen as a geometrical characteristic 
of the cells in the random forest. Indeed, fixing X^, the quantity X^) 

is nothing but the empirical probability that and x are connected (i.e. in 
the same cell) in the M finite random forest. Thus, the connection function 
is a natural way to build kernel functions from random forests, a fact that 
had already been noticed by Breiman (2001). Note that these kernel functions 
have the nice property of being positive semi-definite, as proved by Davies and 
Ghahramani (2014). 

A natural question is to ask what happens to KeRF estimates when the number 
of trees M goes to infinity. To this aim, we define infinite KeRF estimates 
by, for all x, 

TOoo,n(x) = lim m M ,n(x, ©i,...,0m). (6) 

M —>-00 

In addition, we say that an infinite random forest is discrete (resp. continuous) 
if its connection function K n is piecewise constant (resp. continuous). For 
example, Breiman forests and centred forests are discrete but uniform forests 
are continuous. Denote by P© the probability with respect to 0, conditionally 
on V n . Proposition 2 extends the results of Proposition 1 to the case of infinite 
KeRF estimates. 

Proposition 2. Consider an infinite discrete or continuous forest. Then, 
almost surely, for all x, z G [0, l] d , 

lim K M n (x, z) = K n (x, z), 

M —>-oo 

where 


K n (x, z) = P e [z e A n (x, 0)]. 

We call K n the connection function of the infinite random forest. Thus, for all 
x G [0, l] d , one has 

~ YiK n (x,Xi 

Wloo,n ( x ) 7^ / v \ ' 

Z_^£=l Kn( x ? X^J 

This lemma shows that infinite KeRF estimates are kernel estimates with kernel 
function equal to K n . Observing that if n (x, z) is the probability that x and z 
are connected in the infinite forest, the function K n characterizes the shape of 
the cells in the infinite random forest. 

Now that we know the expression of KeRF estimates, we are ready to study 
how close this approximation is to random forest estimates. This link will be 
further work out in Section 4 for centred and uniform KeRF and empirically 
studied in Section 5. 


5 



3 Relation between KeRF and random forests 


In this section, we investigate in which cases KeRF and forest estimates are 
close to each other. To achieve this goal, we will need the following assumption. 

(HI) Fix x G [0, l] d , and assume that Y >0 a.s.. Then, one of the following 
two conditions holds: 


(Hl.l) There exist sequences (o n ), ( b n ) such that, a.s., 

a n < N n (x, 0) < b n . 


(HI.2) There exist sequences (e n ), (a«), (b n ) such that, a.s., 


1 < a n < E© [N n (x, 0)] < b n 


• P f 


a n < N n (x.,&) < b r 


> 1 


(HI) assumes that the number of points in every cell of the forest can be 
bounded from above and below. (Hl.l) holds for finite forests for which the 
number of points in each cell is controlled almost surely. Typically, (Hl.l) is 
verified for adaptive random forests, if the stopping rule is properly chosen. On 
the other hand, (HI. 2) holds for infinite forests. Note that the first condition 
E© [7V n (x, ©)] > 1 in (HI.2) is technical and is true if the level of each tree is 
tuned appropriately. Several random forests which satisfy (HI) are discussed 
below. 

Proposition 3 states that finite forest estimate rriM,n and finite KeRF estimate 
n~iM,n are close to each other assuming that (Hl.l) holds. 

Proposition 3. Assume that (Hl.l) is satisfied. Thus, almost surely, 

^M,n(x, 01 , . . . , 6m) _ 1 < K Un 
^M,n(x, @1, . . . , ©m) _ CLn 

with the convention that 0/0 = 1. 

Since KeRF estimates are kernel estimates of the form (1), Proposition 3 stresses 
that random forests are close to kernel estimates if the number of points in each 
cell is controlled. As highlighted by the following discussion, the assumptions 
of Proposition 3 are satisfied for some types of random forests. 


Centred random forests of level k. For this model, whenever X is uni¬ 
formly distributed over [0, l] d , each cell has a Lebesgue-measure of 2 ~ k . Thus, 
fixing x G [0, l] d , according to the law of the iterated logarithm, for all n large 
enough, almost surely, 


An(x,0) 


n y/2n log logn 
¥ ~ 2 
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Consequently, (Hl.l) is satisfied for a n = n2 k — y/2n log logn/2 and b n = 
n2~ k + ^2nloglogn/2. This yields, according to Proposition 3, almost surely, 


m M ,n(x, Qi,...,Qm) _ 1 < V2 n log logn 
fh M ,n{x, ©l,---,0 m) _ n2~ k - V 2 n log logn/2’ 


Thus, choosing for example k = (log 2 n)/ 3, centred KeRF estimates are asymp¬ 
totically equivalent to centred forest estimates as n —>> oo. The previous inequal¬ 
ity can be extended to the case where X has a density / satisfying c < / < C, 
for some constants 0 < c < C < oc. In that case, almost surely, 

w M ,n(x, 8i,...,6m) _ , <r y/2n log log n + (C - c)n/2 k _ 
rn M ,n(x, 0i, • • • ,©m) - nc2~ k - y/2n log logn/2 

However, the right-hand term does not tend to zero as n —> oo, meaning that 
the uniform assumption on X is crucial to prove the asymptotic equivalence of 
rriM,n and friM,n in the case of centred forests. 


Breiman’s forests. Each leaf in Breiman’s trees contains a small number of 
points (typically between 1 and 5). Thus, if each cell contains exactly one point 
(default settings in classification problems), (Hl.l) holds with a n = b n = 1. 
Thus, according to Proposition 3, almost surely, 


m M ,n( x, 01 ,..., ©m) = ro M ,n(x, 01 ,, 0m). 


More generally, if the number of observations in each cell varies between 1 and 
5, one can set a n = 1 and b n = 5. Thus, still by Proposition 3, almost surely, 


»lM,n(x,01,...,0 M ) _ 1 

TO M) n(x, 0 i,...,0m) 


< 4. 


Median forests of level k . In this model, each cell of each tree is split at 
the empirical median of the observations belonging to the cell. The process is 
repeated until every cell is cut exactly k times (where k G IN is a parameter 
chosen by the user). Thus, each cell contains the same number of points ±2 
(see, e.g., Biau and Devroye, 2013, for details), and, according to Proposition 
3, almost surely, 


WM,n(x, 0 1,...,0 m) _ 1 < _2_ 

WM,n(x, ©1, • • • , ©m) _ d n 

Consequently, if the level k of each tree is chosen such that a n —)► oo as n oo, 
median KeRF estimates are equivalent to median forest estimates. 

The following lemma extends Proposition 3 to infinite KeRF and forest esti¬ 
mates. 

Proposition 4. Assume that (HI.2) is satisfied. Thus, almost surely, 


m c 


i(x) - m 00 ,„(x)| < 


bn. 1 


-m 0 


(x) + ne n 


( 


max Yi ). 

l<i<n J 
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Considering inequalities provided in Proposition 4, we see that infinite KeRF 
estimates are close to infinite random forest estimates if the number of obser¬ 
vations in each cell is bounded (via a n and b n ). 

It is worth noticing that controlling the number of observations in each cell while 
obtaining a simple partition shape is difficult to achieve. On the one hand, if 
the tree construction depends on the training set, the algorithm can be stopped 
when each leaf contains exactly one point and thus KeRF estimate is equal to 
random forest estimate. However, in that case, the probability K n (x, z) is very 
difficult to express since the geometry of each tree partitioning strongly depends 
on the training set. On the other hand, if the tree construction is independent 
of the training set, the probability K n (x, z) can be made explicit in some cases, 
for example for centred forests (see Section 5). However, the number of points in 
each cell is difficult to control (every leaf cannot contain exactly one point with 
a non-adaptive cutting strategy) and thus KeRF estimate can be far away from 
random forest estimate. Consequently, one cannot deduce an explicit expression 
for random forest estimates from the explicit expression of KeRF estimates. 


4 Two particular KeRF estimates 


According to Proposition 2, infinite KeRF estimate depends only on the 

connection function K n via the following equation 


m oo,n(x) 


E?=l YjKnfrXj) 

ELl Kn(x,*t) ' 


( 7 ) 


To take one step further into the understanding of KeRF, we study in this 
section the connection function of two specific infinite random forests. We focus 
on infinite KeRF estimates for two reasons. Firstly, the expressions of infinite 
KeRF estimates are more amenable to mathematical analysis since they do not 
depend on the particular trees used to build the forest. Secondly, the prediction 
accuracy of infinite random forests is known to be better than that of finite 
random forests (see, e.g., Scornet, 2014). Therefore infinite KeRF estimates are 
likely to be more accurate than finite KeRF estimates. 


Practically, both infinite KeRF estimates and infinite random forest estimates 
can only be approximated by Monte Carlo simulations. Here, we show that 
centred KeRF estimates have an explicit expression, that is their connection 
function can be made explicit. Thus, infinite centred KeRF estimates and infi¬ 
nite uniform KeRF estimates (up to an approximation detailed below) can be 
directly computed using equation (7). 


Centred KeRF As seen above, the construction of centred KeRF of level k 
is the same as for centred forests of level k except that predictions are made ac¬ 
cording to equation (3). Centred random forests are closely related to Breiman’s 
forests in a linear regression framework. Indeed, in this context, splits that are 
performed at a low level of the trees are roughly located at the middle of each 
cell. In that case, Breiman’s forests and centred forests are close to each other, 
which justifies the interest for these simplified models, and thus for centred 
KeRF. 
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In the sequel, the connection function of the centred random forest of level k 
is denoted by K^ c . This notation is justified by the fact that the construction 
of centred KeRF estimates depends only on the size of the training set through 
the choice of k. 


Proposition 5. Let k G IN and consider an infinite centred random forest of 
level k. Then, for all x, z E [0, l] d , 


Ki 


= ( x > z )= 


k\ 


ki t ...,k d 
E?=i ke=k 


ki\.. ,k d \ \d 


k d 


IF 

j=i 


\2 k ox^ = \2 k o Zj y 


Note that ties are broken by imposing that cells are of the form Yli=i M where 
the Ai are equal to ]cq,6^] or [0,6^], for all 0 < < bi < 1. Figure 1 shows a 

graphical representation of the function / defined as 

fk : [0,1] x [0,1] [0,1] 

z = (z 1 ,z 2 ) z). 





Figure 1: Representations of /i, and f$ in [0, l] 2 


Denote by m ^ n the infinite centred KeRF estimate, associated with the con¬ 
nection function K% c , defined as 


t( x ) 


EHi YjKFfrXj) 

Eti^r( x -x^) • 


To pursue the analysis of m ££ n , we will need the following assumption on the 
regression model. 

(H2) One has 


Y = m(X)+e, 

where e is a centred Gaussian noise, independent of X, with finite variance 
a 2 < oo. Moreover, X is uniformly distributed on [0, l] d and m is Lipschitz. 

Our theorem states that infinite centred KeRF estimates are consistent whenever 
(H2) holds. Moreover, it provides an upper bound on the rate of consistency 
of centred KeRF. 
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Theorem 4.1. Assume that (H2) is satisfied. Then, providing k —>■ 00 and 
n/2 k oo, there exists a constant C\ > 0 such that, for all n > 1, and for all 

x e [o, i] d , 


E [m“ „(x) - m(x )] 2 < C^n-^+^^logn) 2 . 

Observe that centred KeRF estimates fail to reach minimax rate of consistency 
n -2/(d+2) over c j ass 0 f Lipschitz functions. A similar upper bound on the 
rate of consistency 3//4rf log 2 + 3 of centred random forests was obtained by Biau 
(2012). It is worth noticing that, for all d > 9, the upper bound on the rate 
of centred KeRF is sharper than that of centred random forests. This theo¬ 
retical result supports the fact that KeRF procedure has a better performance 
compared to centred random forests. This will be supported by simulations in 
Section 5 (see Figure 5) 


Uniform KeRF Recall that the infinite uniform KeRF estimates of level k 
are the same as infinite uniform forest of level k except that predictions are 
computed according to equation (3). Uniform random forests, first studied by 
Biau et al. (2008), remain under active investigation. They are a nice modelling 
of Breiman forests, since with no a priori on the split location, we can consider 
that splits are drawn uniformly on the cell edges. Other related versions of 
these forests have been thoroughly investigated by Arlot and Genuer (2014) 
who compare the bias of a single tree to that of the whole forest. 

As for the connection function of centred random forests, we use the notational 
convention to denote the connection function of uniform random forests of 
level k. 

Proposition 6. Let k G IN and consider an infinite uniform random forest of 
level k. Then, for all x £ [0, l] d , 

K'( ».*>= £ - k 

ki i — jfed 

£ti k e= k 

with the convention = o. 

Proposition 6 gives the explicit expression of ( 0,x). Figure 2 shows a rep¬ 

resentation of the functions /i, f^ and f§ defined as 

f k : [0,1] x [0,1] ^ [0,1] 

Z = (z 1 ,z 2 ) H> K% f ( 0,|z-(§,§)|), 

where |z - x| = Oi - xi\,... ,\z d - x d \). 

Unfortunately, the general expression of the connection function K^(x,z) is 
difficult to obtain. Indeed, for d = 1, cuts are performed along a single axis, 
but the probability of connection between two points x and z does not depend 
only upon the distance \z — x\ but rather on the positions x and z, as stressed 
in the following Lemma. 


k\ 


i!.. 


• k d \ \d 


n 

m= 1 


fem 1 

£ 

j =o 


{-\ax m y 
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Figure 2: Representations of /i, and in dimension two 


Lemma 1. Let x, z G [0,1]. Then, 

K^\x, z) = 1 — \z — x\, 

K% f (x,z) = l-\z-x\ + \z -x|log ■ 


A natural way to deal with this difficulty is to replace the connection function 
KfJ by the function (x, z) —»• ( 0, |z — x|). Indeed, this is a simple manner 

to build an invariant-by-translation version of the uniform kernel . The 
extensive simulations in Section 5 support the fact that estimates of the form 
(7) built with these two kernels have similar prediction accuracy. As for infi¬ 
nite centred KeRF estimates, we denote by fh^ n the infinite uniform KeRF 
estimates but built with the invariant-by-translation version of TC{J, namely 


m, 


uf 


(x) 


EL 1 i^ / (0,|x < -x|) ■ 


Our last theorem states the consistency of infinite uniform KeRF estimates along 
with an upper bound on their rate of consistency. 


Theorem 4.2. Assume that (H2) is satisfied. Then, providing k oo and 
n/2 k —>• oo ; there exists a constant C\ > 0 such that, for all n > 1 and for all 

xe[0,lf, 

E [m^ n (x) — m(x)] 2 < Cin _2 ^ 6+3dl ° s2 - ) (logn) 2 . 


As for centred KeRF estimates, the rate of consistency does not reach the min¬ 
imax rate on the class of Lipschitz functions, and is actually worse than that of 
centred KeRF estimates, whatever the dimension d is. Besides, centred KeRF 
estimates have better performance than uniform KeRF estimates and this will 
be highlighted by simulations (Section 5). 

Although centred and uniform KeRF estimates are kernel estimates of the form 
(1), the usual tools used to prove consistency and to find rate of consistency 
of kernel methods cannot be applied here (see, e.g., Chapter 5 in Gyorfi et ah, 
2002). Indeed, the support of z A| c (x, z) and that of z ( 0, |z — x|) 

cannot be contained in a ball centred on x, whose diameter tends to zero (see 
Figure 1 and 2). The proof of Theorem 4.1 and 4.2 are then based on the pre¬ 
vious work of Greblicki et al. (1984) who proved the consistency of kernels with 
unbounded support. In particular, we use their bias/variance decomposition of 
kernel estimates to exhibit upper bounds on the rate of consistency. 
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5 Experiments 


Practically speaking, Breiman’s random forests are among the most widely used 
forest algorithms. Thus a natural question is to know whether Breiman KeRF 
compare favourably to Breiman’s forests. In fact, as seen above, the two al¬ 
gorithms coincide whenever Breiman’s forests are fully grown. But this is not 
always the case since by default, each cell of Breiman’s forests contain between 
1 and 5 observations. 

We start this section by comparing Breiman KeRF and Breiman’s forest esti¬ 
mates for various regression models described below. Some of these models are 
toy models (Model 1, 5-8). Model 2 can be found in van der Laan et al. 
(2007) and Models 3-4 are presented in Meier et al. (2009). For all regres¬ 
sion frameworks, we consider covariates X = (Xi,... , X^) that are uniformly 
distributed over [0, l] d . We also let = 2(X* — 0.5) for 1 < i < d. 


Model 1: n = 800, d = 50, Y = X 2 + exp(—X 2 ) 

Model 2: n = 600, d = 100, Y = X\X 2 + X 2 - X 4 X 7 + X 8 Xio - Xf +W(0,0.5) 

Model 3: n — 600, d = 100, Y = — sin(2Xi) + X| + X 3 — exp(—X 4 ) +Af(0, 0.5) 

Model 4: n = 600, d = 100, Y = Xi +(2X 2 —l) 2 +sin(27rX 3 )/(2 —sin(27rX 3 )) + 

sin(27rX 4 ) + 2 cos(27rX 4 ) + 3 sin 2 (27rX 4 ) + 4 cos 2 (27rX 4 ) + A/"(0, 0.5) 

Model 5: n = 700, d = 20, F = l^iX) ^2 + ^ x 4 +x 6 -x 8 -x 9 >i+x 10 + 

exp(—X 2 ) + J\T( 0, 0.5) 

Model 6: n = 500, d = 30, Y = <0 - 1jv ( o,i)>i.25 

Model 7: n = 600, d = 300, Y = Xf+X^Xs exp(-|X 4 |) + X 6 -X 8 +JV( 0,0.5) 

Model 8: n = 500, d = 1000, Y = X ± + 3X| - 2 exp(-X 5 ) + X 6 

All numerical implementations have been performed using the free Python soft¬ 
ware, available online at https://www.python.org/. For each experiment, the 
data set is divided into a training set (80% of the data set) and a test set (the 
remaining 20%). Then, the empirical risk (IL 2 error) is evaluated on the test 
set. 

To start with, Figure 3 depicts the empirical risk of Breiman’s forests and 
Breiman KeRF estimates for two regression models (the conclusions are similar 
for the remaining regression models). Default settings were used for Breiman’s 
forests (minsamplessplit = 2 , maxf eatures = 0.333) and for Breiman KeRF, 
except that we did not bootstrap the data set. Figure 3 puts in evidence 
that Breiman KeRF estimates behave similarly (in terms of empirical risk) to 
Breiman forest estimates. It is also interesting to note that bootstrapping the 
data set does not change the performance of the two algorithms. 

Figure 4 (resp. Figure 5) shows the risk of uniform (resp. centred) KeRF 
estimates compared to the risk of uniform (resp. centred) forest estimates (only 
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Model 1 Model 2 




Figure 3: Empirical risks of Breiman KeRF estimates and Breiman forest esti¬ 
mates. 


two models shown). In these two experiments, uniform and centred forests and 
their KeRF counterparts have been grown in such a way that each tree is a 
complete binary tree of level k = |_log 2 n \ • Thus, in that case, each cell contains 
on average n/2 k ~ 1 observation. Once again, the main message of Figure 4 is 
that the uniform KeRF accuracy is close to the uniform forest accuracy. 


Model 1 



Number of trees 


Model 2 



Figure 4: Empirical risks of uniform KeRF and uniform forest. 


On the other hand, it turns out that the performance of centred KeRF and 
centred forests are not similar (Figure 5). In fact, centred KeRF estimates are 
either comparable to centred forest estimates (as, for example, in Model 2), or 
have a better accuracy (as, for example, in Model 1). A possible explanation 
for this phenomenon is that centred forests are non-adaptive in the sense that 
their construction does not depend on the data set. Therefore, each tree is 
likely to contain cells with unbalanced number of data points, which can result 
in random forest misestimation. This undesirable effect vanishes using KeRF 
methods since they assign the same weights to each observation. 

The same series of experiments were conducted, but using bootstrap for com¬ 
puting both KeRF and random forest estimates. The general finding is that 
the results are similar—Figure 6 and 7 depict the accuracy of corresponding 
algorithms for a selected choice of regression frameworks. 

An important aspect of infinite centred and uniform KeRF is that they can be 
explicitly computed (see Proposition 5 and 6). Thus, we have plotted in Figure 
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Figure 5: Empirical risks of centred KeRF and centred forest. 
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Figure 6: Empirical risks of uniform KeRF and uniform forest (with bootstrap). 


Model 1 Model 2 




Figure 7: Empirical risks of centred KeRF and centred forests (with bootstrap). 


8 the empirical risk of both finite and infinite centred KeRF estimates for some 
examples (for n = 100 and d = 10). We clearly see in this figure that the 
accuracy of finite centred KeRF tends to the accuracy of infinite centred KeRF 
as M tends to infinity. This corroborates Proposition 2. 

The same comments hold for uniform KeRF (see Figure 9). Note however that, 
in that case, the proximity between finite uniform KeRF and infinite uniform 
KeRF estimate strengthens the approximation that has been made on infinite 
uniform KeRF in Section 4. 

The computation time for finite KeRF estimate is very acceptable for finite 
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Figure 8: Risks of finite and infinite centred KeRF. 


Model 1 Model 2 




Figure 9: Risks of finite and infinite uniform KeRF. 


KeRF and similar to that of random forest (Figure 3-5). However, the story 
is different for infinite KeRF estimates. In fact, KeRF estimates can only be 
evaluated for low dimensional data sets and small sample sizes. To see this, just 
note that the explicit formulation of KeRF involves a multinomial distribution 
(Proposition 5 and 6). Each evaluation of the multinomial creates computa¬ 
tional burden when the dimensions (d and n) of the problems increases. For 
example, in Figure 8 and 9, the computation time needed to compute infinite 
KeRF estimates ranges between thirty minutes to 3 hours. As a matter of fact, 
infinite KeRF methods should be seen as theoretical tools rather than a practical 
substitute for random forests. 


6 Proofs 


Proof of Proposition 1. By definition, 


M n 


m M ,n(x,©i,...,©M) = M - 

Z^j=l 1^1=1 iI XiGA ri (x,© i ) j =1 i =1 




©i) 


M 


E m -|i l —/ 

j = 1 2-^i= 1 G-A-n. ( x ?©i) i= 1 
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Finally, observe that 


^ M n n 

JjYY 1 X i eA„(x,© J ) = Y K M,n(x, Xi), 
j=l i=l i=1 

which concludes the proof. □ 

Proof of Proposition 2. We prove the result for d = 2. The other cases can 
be treated similarly. For the moment, we assume the random forest to be 
continuous. Recall that, for all x, z G [0, l] 2 , and for all M G IN, 

1 M 

-^M,n( X 5 Z ) = fzG^ n (x,e,-)- 

3 = 1 

According to the strong law of large numbers, almost surely, for all x, z G 

Q 2 n [o, i ] 2 


lim i^M,n(x,z) = FTn(x,z). 

M^oo 

Set 5 > 0 and x, z G [0, l] 2 where x = (a^ 1 ),^ 2 )) and z = z( 2 )). Assume, 

without loss of generality, that x ^ and x^ < z^ 2 \ Let 

A x = {u G [O,!] 2 ,^ 1 ) < x ^ and u^ < x^}, 
and A z = {u G [0,1 ] 2 ,u^ > z ^ and > z^}. 

Choose xi G A x D Q 2 (resp. z 2 Gi z fl Q 2 ) and take x 2 G [0, l] 2 flQ 2 (resp. 
zi G [0, l ] 2 fl Q 2 ) such that < x^ < x^ and < x^ < x^ (resp. 
z[^ < z^ < z.^ and z^ < z^ < z^\ see Figure 10). 



Figure 10: Respective positions of x, xi,x 2 and z, zi,z 2 


Observe that, because of the continuity of lC n , one can choose xi,x 2 close 
enough to x and z 2 , zi close enough to z such that 


|iW(x 2 ,xi) - 1| < e, 

\K n (zi, z 2 ) - 1| < e, 
and \K n (xi,z 2 ) - AT n (x,z)| < 5. 
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Bounding the difference between Km,u and K n , we have 

|i^M,n(x,z) - K n (x, z)| < |i^ M ,n(x,z) - K M ,n(^l, Z 2 )| 

+ |-f^M,n(xi, Z 2 ) - ^ n (xi, Z 2 )| 

+ |-Kn(xi,Z 2 ) - K n (x, z)| . 


(8) 


To simplify notation, we let x eT z be the event where x and z are in the same 
cell in the tree built with randomness Qj and dataset V n . We also let x ©© z 

be the complement event of x eT z. Accordingly, the first term on the right 
side in equation (8) is bounded above by 


1 M | 

|-KM,n(x,z) - ifM,n(xi,Z 2 )| < — ^ 1 


m=l 

M 


<iVi e 

M xi 


' 1 x 1 < IT'z 2 


■ 1 ©77 

z 2 ^ 


m=l 

iven the positions of x, xi, z, Z 2 ) 

M M 

+ lT 1 


1 M 

<-V 1 

— M Z_^ > 


, r / — © m 
M Z ' Xi x 2 
m=l 


M ^ Z 2 - 

m=l 


( 9 ) 


given the respective positions of x, xi, x 2 and z, zj, z 2 . But, since x 2 , zi, xi, z 2 G 
<Q 2 f| [0,1] 2 , we deduce from inequation (9) that, for all M large enough, 

|#M,n(x,z) - A _ M,n(xi,z 2 )| < 1 — K n (x 2 ,xi) + 1 - K n (z 1 ,z 2 ) + 2e. 

Combining the last inequality with equation (8), we obtain, for all M large 
enough, 

|-KM,n(x,z) - K n (x, z)| < 1 - Kji (x 2 , Xi) + 1 - K n (z 1 ,z 2 ) 

+ |-K'm,ti(xi,z 2 ) - i^ n (xi,z 2 )| 

+ |iCi(xi,z 2 ) - K n (x,z)| + 2e 

< 6e. 

Consequently, for any continuous random forest, almost surely, for all x, z G 

[0, l] 2 , 

lim K M , n (x, z) = K n (x, z). 

M —>-oo 

The proof can be easily adapted to the case of discrete random forests. Thus, 
this complete the first part of the proof. Next, observe that 

E?=l^Af,n(x,Xi) _ Er=l^n(*»X0 


m“oo E"=l^M,n(x,X i ) E"=l ^n(x, Xj) ’ 

for all x satisfying Ey=i fCifx, X ; ) ^ 0. Thus, almost surely for those x, 

lim m M ,n(x) = TOoo in (x). 

M—»• oo 


( 10 ) 


Now, if there exists any x such that X!j=i A" n (x, Xj) = 0, then x is not con¬ 
nected with any data points in any tree of the forest. In that case, 
^M,n( x ,Xj) = 0 and, by convention, ^^(x) = mM,n( x ) = 0- Finally, for¬ 
mula (10) holds for all x E [0, l] 2 . □ 
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Proof of Proposition 3. Fix x E [0, l] d and assume that, a.s., Y > 0. By as¬ 
sumption (Hl.l), there exist sequences (a n ), ( b n ) such that, almost surely, 

^ ^n(x, 0) ^ b n . 

To simplify notation, we let A/m,tz( x , 0) = N n (x,Qj)- Thus, almost 

surely, 


|^M,n(x) -m M ,n(x)| 


n / n M t 


Z— 1 


^ N n (x,e m ) 


n / M 


Z— 1 


■^■XiGA n (x,© m ) 

^ lV M>n (x) 


1 n M 


1 


XiGA n (x,0 m ) 


z=l m 

^ bn tin ~ / \ 

< -m M ,n(x). 


^ A^M,n(x) 


N M ,n (x) 


7V n (x,0 m ) 


- 1 


□ 


Proof of Proposition 4- Fix x E [0, l] d and assume that, almost surely, T > 0. 
By assumption (HI.2), there exist sequences (a n ), (b n ), (£ n ) such that, letting 
A be the event where 


^n — ^n(x, 0) ^ 6 n , 

we have, almost surely, 

P©[^] >l-e n and 1 < a n < E© [lV n (x, 0)] < b n 
Therefore, a.s., 

1^00,n(x) -moo,n(x)| 


^V;Ee 


Z— 1 


T 


XiE^n(x,0) 




Vj(x, 0) 
Ix^Tix. 




Z— 1 


1 


x* e A n (x,©) 


Z— 1 

^ b n o n _ 
< - 


E e [iV n (: 


LE e [AT„(x, ©)] J 
c,e) /E© [AT n (x, 0)] | \ 


< 


CL n 

b n n n 


Qjrr] 


>,n(x) + ( max Fj )E© 

Vl<Kn / 
fh oo,n(x) + n^myijp[i c ]. 


iV„(x,0) 


E© [JV n (x, ©)] 


Consequently, almost surely, 


|»rio 0 ,n(x) - m 0O ,„(x)| < ———ra 0 


i(x) + ne ra ( max FA 

\l<Kn / 


□ 
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Proof of Proposition 5. Assume for the moment that dm 1. Take x, z G [0,1] 
and assume, without loss of generality, that x < z. Then the probability that x 
and z be in the same cell, after k cuts, is equal to 

^2 h ix^ = \2 ***/|- 

To prove the result in the multivariate case, take x, z G [0, 1]A Since cuts are 
independent, the probability that x and z are in the same cell after k cuts is 
given by the following multinomial 

K k^ z ) = k 

ki,...,k d 

Hi=\ ka=k 


k\ 


n 


i !...fc d ! 1 , 

3 =i 


ifi 


\2 k ox^ = \2 k o Zj y 


□ 


To prove Theorem 4.1, we need to control the bias of the centred KeRF estimate, 
which is done in Theorem 6.1. 


Theorem 6.1. Let f be a L-Lipschitz function. Then, for all k, 


sup 

xG[0,l] d 


/ [0 ,l] d K k°Y z )f( z ) dz 1 • • • dz d 
f[0,l]“- K k C ( X ’ Z ) dz l'-- dz d 


/(x) 


< Ld 1 



Proof of Theorem 6.1. Let x G [0, l] d and k G IN. Take / a L-Lipschitz function. 
In the rest of the proof, for clarity reasons, we use the notation dz instead of 
dz\... d Zd- Thus, 

/[ 0 ,i]^r(x,z)/(z) d z _ 

/[0,i]^“( x ’ z ) dz 
Note that, 

/ ^ cc (x,z)|/(z)-/(x)|dz 

d r> 

<T / K^ c (x,z)\ze - x e \dz 

7Z[ ^[o,i] d 

d p 

<lY f Y 

IlI j = i kj = k 

x [ K k c t (xe, z e)\ z e ~ xe\ d z e . (11) 



/[ 0 , 1 ]«* ^ cc (x,z)|/(z) - /(x)|dz 
J[o,i]^f( x ’ z ) dz 
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The last integral is upper bounded by 


/ Kj£(xe,zi)\xe-ze\dz e = \ x * ~ z t\ Az i 

J[ 0,1] J[ 0,1] 

< ( 2 ) f 1 V ke xe-}’m\2 k eztl dz t 

(^) J^ o ^ K ki( x e,Zi)dze. 
Therefore, combining inequalities (11) and (12), we obtain, 

/ if“(x,z)|/(z) - /(x)|dz 


< 3 


( 12 ) 


qo,ip 



k m Zm)^-Z"n 


(13) 


since, simple calculations show that, for all x m G [0,1] and for all k m G IN, 
j Kk, rn (Xm, Z m )dZ m = ^\2 k m X m\ = \2 k rnZm\^ Z rri = ^ ^ 

Consequently, we get from inequality (13) that 


(14) 


/ [01]d *Tf(x,z)|/(z)-/(x)|^ /-,\k d 

d, 


J[o,i]^fc C ( x > z ) dz 


k\ 


£=1 ki,...,k d 

Y2j=i kj = k 


h\...k d \ V 2 


ki 


Taking the first term of the sum, we obtain 



k 1 ? ...; k d 
E^=i kj=k 


k\ 

fci!...fcd! 



k 


E 


k 1 



k — ki 


k\ 

hUj^hV- 


< 



Finally, 


/[o,i]^fc c ( x » z )l/( z )-/( x )l dz / 

/[ 0 ,i]^r( x > z ) dz ~ v 2d J 


□ 
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Proof of Theorem 4-T Let x E [0, l] d , ||m||oo = sup |ra(x)| and recall that 

™S,n( x ) =i 


xG [0,l] d 

Er=lW(x,X<) 


Er=i^r(*.Xi) 


Thus, letting 


4 M = lv (MM _ E[y^r(x,x)] 

nl 1 n ^ U [iff (x, X)] E [iff (x, X)] ; ’ 

B (x) = lyf ^ cc(x ’ x - } -L 

nU n^U^r(x,X)] T 

, , x E[Tiff(x,X)] 

and M " (X) = E[iff(x,X)] ’ 

the estimate rhf n (x) can be rewritten as 

M n (x) + A w (x) 

1 + B n (x) 

M n (x) - m(x) + A ra (x) - £? ra (x)m(x) 
1 + B n (x) 

According to Theorem 6.1, we have 

E[m(X)iff(x,X)] , E [eiff (x, X)] 


<S,n( x ) = 

which leads to 

™»,n(x) - m(x) = 


|M„(x) - m(x)| = 


< 


E [Xfc C (x, X)] E[iff(x,X)] 
E [ra(X)iff (x, X) 


— m(x) 


E[iff(x,X)] 


— m(x) 




where Ci = Td. Take <a e] 0,1/2]. Let C a (x) be the event on which ||A n (x)|, 
1_B n (x)| < a}. On the event C a (x), we have 

,„(x) - m(x)| 2 < 8|M n (x) - to(x)| 2 + 8|A n (x) - B„(x)m(x)| 2 


1 \ 


2/c 


- 8C d 1 -*) +8a 2 (l+||m|| 00 ) 2 


Thus, 


E [l m ^,n(x) -m(x)| 2 l Ca(x) ] < 8C? ^1 - Pj +8a 2 (l+llmlloo) 2 . (15) 
Consequently, to find an upper bound on the rate of consistency of m \^ n , we 
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just need to upper bound 


E 


l m S,n( x ) -™( x )| 2 lc S (x) 


< E 


max Y{ + m(x)| 2 l C c (x) 

L Ki<n a . 


(since „ is a local averaging estimate) 


< E 


|2|H|oc + max Si | 2 l C e (x) 

1 <i<n . 


< E 


ZWmWoc + max e* 


1/2 


PK(x)] 


l<i<n 

(by Cauchy-Schwarz inequality) 
< (Yl6||m||£ ) +8E 


max 

l<i<n 


l 4 \ \V 2 

PK(x)] 


Simple calculations on Gaussian tails show that one can find a constant C > 0 
such that for all n, 


E 


l 4 


max Si 
Ki<n 


< C(logn) 2 . 


Thus, there exists C 2 such that, for all n > 1, 


E 


l”C,n( x ) - ™( x )| 2 lc S (x)l < C' 2 (logn)(P [C^(x)]) 1/2 . (16) 


The last probability P [C£(x)] can be upper bounded by using Chebyshev’s 
inequality. Indeed, with respect to A n (x), 


1 


P[|A»(x)| >a] < —E 


na 

1 


y^ c (x,X) E[F^ c (x,X)] 


n 2 


< 2 


2 E[iff(x,X)] E[HTf (x,X)] J 

1 


na 2 (E[li:“(x,X)]) 2 

2 1 


E 


< 2 


na 2 (E[Kf(x,X)]) 2 


E 


Y 2 K% c (x, X) 2 
m(X) 2 ^ cc (x,X) 2 


■E 


e 2 K c k c {x,Xy 


< 


< 


2(1^11^ +a 2 ) E[^ c (x,X)] 
na 2 (E[^ c (x,X)]) 2 
(since sup if£ c (x, z) < 1) 

x,zG[0,l] d 

2M\ 2 k 


(according to inequality (14)), 

where M\ = HmH^ + a 2 . Meanwhile with respect to £> n (x), we obtain, still by 
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Chebyshev’s inequality, 


P[|B„(x)| >a] < —E 


ncr 


< 


^ cc (x,X,) 

E[Kf(x,X)] 

1 


no 2 E[ATf (x,X)] 

(since sup K^. c (x, z) < 1 ) 

x,zG[0,l] d 

2 k 


< 2 - 
na z 


Thus, the probability of C a (x) is given by 

P[Ca(x)] > l-P(K(x)| >a) -P(|B„(x)| >a) 

> 2 k 2M\ 2 k 

~ n a 2 not 2 

2 /c (2M 2 + l) 


> 


ncr 


Consequently, according to inequality (16), we obtain 

( 2 k (2 Ml + 1)\ 1/2 


E 


K^ n (x) - ™( x )l lc«(x) < C 2 (log n) 


v ncr 




Then using inequality (15), 


E 


”C,n( x ) -m(x) 


< E [l™S,n( x ) -m(x)| 2 l Ca (x)J + E [|m“ n (x) -m(x)| 2 l c = (x) 

< (l - 1)“ + te >(l + Ml.)> + ft(lo g n)(^e^±t)) 1/2 

Optimizing the right hand side in a, we get 


E 


12 


™S,n( x ) - m ( x ) <8C X I l- — j +C 3 


1 \ 


2k 


(logn) 2 2 /c ^ 1 ^ 3 


2 dy V n 

for some constant C 3 > 0. The last expression is minimized for 


& = C 4 + 


1 


log 


log 2 + 3 \(log^) 2 / ’ 


n 


where C 4 = ^ log 2 ). Consequently, there exists a constant 

C 5 such that, for all n > 1 , 

E [m^ ?n (x) -m(x)] 2 < C 5 n~ dl °s2+3 (log n) 2 . 


□ 
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Proof of Lemma 1. Let x,z G [0,1] such that x < z. The first statement comes 
from the fact that splits are drawn uniformly over [0,1]. To address the second 
one, denote by Zi (resp. Z 2 ) the position of the first (resp. second) split used 
to build the cell containing [x,z]. Observe that, given Zi = z\ 7 Z 2 is uniformly 
distributed over [z i, 1] (resp. [0,2q]) if z\ < x (resp. z\ > z). Thus, we have 




\x,z)= f (r —^d Z1 dz 2 + f 1 d Z1 dz 2 ) 

J Z\ =0 Z2 = Z\ ^ ^1 J Z2=Z 1 ^1 / 

+ I' ( I" -—-— dzidz2 + P d Zl dz 2 ) . 

J Zl =z \Jz 2 =0 1 - Z 1 Jz 2 =z 1 ~Zl J 


The first term takes the form 



x - zi 

- - dzi 

1-zi 


= x — (1 — x) log(l — x). 


Similarly, one has 


I [ z ~— dz\dz 2 = (1 - z) log(l - x), 
J 0 J z 1 ^1 

1 r z i i 

— dzidz 2 = (1 — z) + z log z, 


1 pX 


Zi 

1 


— dzidz 2 = — xlogz. 
o zi 


Consequently, 


PCf'^(x, z) = x — (1 — x) log(l — x) + (1 — z) log(l — x) 
— x log z + (1 — z) + z log z 

= 1 - (z - x) + (z - x) log ( ^— ) . 


□ 

Proof of Proposition 6. The result is proved in Technical Proposition 2 in Scor- 
net (2014). 

□ 


To prove Theorem 4.2, we need to control the bias of uniform KeRF estimates, 
which is done in Theorem 6.2. 


Theorem 6.2. Let f be a L-Lipschitz function. Thenfor all k, 


sup 

xG[0,l] d 


J[0,1]“ K k f (°’ l z - x l)/( z ) d ^i • • • dz d 
/[0,l]d K k f (°» l z - x l) d ^l • • • dz d 


/(x) 


Ld2 2d+1 

3 
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Proof of Theorem 6.2. Let x E [0, l] d and k E IN. Let / be a L-Lipschitz func¬ 
tion. In the rest of the proof, for clarity reasons, we use the notation dz instead 
of dzi... dzd . Thus, 

/[o,i]^fc / (°^l z ~ x l)/( z ) dz _ < /[ 0 , 1 ]C *?'((>, l z - x|)l/(z) - /(x)|dz 

/[o,i]^fc / (°’l z “ x l) dz ~~ I [0 , lV Kf(0,\z- X \)dz 

Note that, 


[ K k f ( 0 A z ~ x \)\f( z )-f( x )\ dz 

J[o,i] d 

Ct r> 

<LY] / tf“'(0,|z-x|)|^. a *|dz 

7Z[ ^[o,i] d 

~TX. V z* wtwG) 

£j=i kj=k 

-% w^wG) n/ K ‘~ (o ' |z ”- x ” l)d2 '" 


J2j=i kj=k 
X 


[ K^(0, \zi — xe\)\zg — xt\&Z£ 

JO 


^£ £ 


fc! 


^=1 ki,...,kd 

E<=1 k 3= k 


hi... k d \ \ 3 


&z + l / -. \ A; ^ />! 


(0, |z m - ajjnDdZn 


m=l ' 


(according to the second statement of Lemma 2, see below) 


L 


fe! 


2 k d fci!...fcd! I 3 

*=i fei,-,fcd v 

£?=i 




(17) 


according to the first statement of Lemma 2. Still by Lemma 2 and using 
inequality (IT), we have, 


/[ 0 ,i]<i Kf(0, |z - x|)|/(z) - /(x)|dz 

/[o,i]^fc / (°>l z - x l) dz 


< 


L2 2d+1 


£ £ 

£=1 k u ...,k d 

£^=i kj=k 


k\ 


h\...k d \ 
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Taking the first term of the sum, we obtain 


E 


k\ 5- l^d 

J2j=i kj =k 


k\ 

fei!...W 




fei 


l\ fe_fel k\ 
d) fe a !(fe — fei)! 



Finally, 


Jjo,i] d l z - x l)l/( z ) - /( x )l dz < L2 2d+1 

J[o,i]«- K k f ( 0 ’\ z - x \) dz ~ 3 



□ 


Proof of Theorem 4-2. Let x G [0, l] d , ||m||oo = sup |m(x)| and recall that 


xG[0,l] d 


"CG( X ) 


Er=i^ / (o,ix i -x| ) 


Er= 1 ^ / (o,ix i - X |) 


Thus, letting 


Mx) = I£ f y < i^ / (o,|x i -x|) E[r^(o,|x-x|)] 


Gt VEK^OjX-xl)] E[X^(0 ,|X-x|)] T 

it 


BnM = i±i K'(0,ix t -*n 


U VeKGojx-xD] 

and M„(x) = |X - x|) ] 

V ' E W/ 


E[^fc(0,|X-x|)] 
the estimate m^ n (x) can be rewritten as 


uf , x _ M ra (x) + A ra ( x ) 
°°’ ni j 1 + B„(x) ’ 


which leads to 


~ ™( x ) = 


M n (x) - m(x) + .A n (x) - S n (x)m(x) 


1 + S n (x) 

Note that, according to Theorem 6.2, we have 

E[m(X)X^(0, |X - x|)] , E[eX^(0,|X-x[ 


|M n (x) - m(x) | 


fc E[K fc u/ (0, |X - x|; 


< 


E[X^(0,|X-x|)] 
E[m(X)X^(0, |X - x| 


+ 


m(x) 


E[X£'(0,|X-x|)] 


— m(x) 




26 




















where C\ = L2 2d+1 /3. Take a e] 0, 1/2]. Let C a (x) be the event on which 
{|A n (x)|, \B n (x)\ < a}. On the event C a (x), we have 

l m £f,n( x ) - m(x)\ 2 < 8|M n (x) - m(x)| 2 + 8|A n (x) - £„(x)m(x)| 2 

/ 1 \ 2fe 

< 8C 2 (l — —j +8a 2 (l+||m|| 00 ) 2 . 

Thus, 

/ 1 \ 2fc 

E [l m J/n( x ) - ™( x )l 2l c a (x)] < 8C 2 (l - —J + 8a 2 (1 + IMIoo) 2 . (18) 

Consequently, to find an upper bound on the rate of consistency of f n , we 
just need to upper bound 


E 


-™( x )| 2 l C c (x 


< E 


max Yi + m(x)\ 2 tcc (-x) 

l<i<n V aK >. 


(since rn'!^ n is a local averaging estimate) 


< E 


| 2 |H|oo + max £i\ 2 t C c (x ) 

1 <i<n . 


< E 


2||ra||oo + max Si 


1/2 


PK(x)l 


< 


1 <i<n 

(by Cauchy-Schwarz inequality) 

1 6 |l m ll «5 + 8 EI" max el )p[C£(x)] 

Ll<i<n J / 


1/2 


Simple calculations on Gaussian tails show that one can find a constant C > 0 
such that for all n, 


E 


1 4 


max 

l<z<n 


< C(logn) 2 . 


Thus, there exists C 2 such that, for all n > 1, 


E 


,n( x ) - w(x)| 2 l C e (x) l < C 2 (logn)(P [C0(x)]) 1/2 . 


(19) 


The last probability P [C£(x)] can be upper bounded by using Chebyshev’s 
inequality. Indeed, with respect to A n (x), 


1 


P[|A n (x)| > a] < —^E 


nor 

1 


Ti^(0,|X-x|) E[Fi^(0, |X - X 

E[^(0,|x-x|; 

1 


e[^(o,|x-x|; 


< na 2 (E[X“'(0,|X-x|)]) 2 
2 1 
“ ^(E[X^(0,|X-x|)])2 

+ E[e 2 K“ / (0,|X-x|) 2 ; 


E 


Y 2 K\ 0,|X-x|) 2 j 
E[m(X)X / (0,|X-x|) 2 ] 
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which leads to 


P[K(X)| > a] < 2 <INI~ +ga > nr k {0Ax-*\)] 

L j na 2 (E[if^(0, |X — x |)]) 2 


(since sup K^(0, \z — x|) < 1 ) 


x,zG[0,l] d 

Ml 2 k 

~ a 2 n 

(according to the first statement of Lemma 2 ), 

where Ml = 2 d+1 (|| m lloo + <j2 )- Meanwhile with respect to L? n (x), we have, still 
by Chebyshev’s inequality, 

r[|B 4 *)l>«]<V 4 K;: ' (0 ’ |X, - x|) 


< 


na* 

1 


E[^y(o,|x-x|; 

1 


E [^fc (0,|X-x| 


2 k-\-d 

< - T- 

na z 

Thus, the probability of C a (x) is given by 

P [C a (x)] >l-P(K(x)| >a)-P(|B„(x)| >a) 

> 2 k Ml 2 k+d 

~ no? no? 

>x 2*(M 1 2 + 2 d ) 


Consequently, according to inequality (19), we obtain 


E 


TO ^,n( x ) - TO ( X ) 1 2 l cs(x) < ^(logn) 


2 k {Ml + 2 d )\ 1/2 

na 2 J 


Then using inequality (18), 


E 


™£f,n( x ) - ™( x ) 


1 2 


< E 


l m ^,n( x ) -™( x )| 2 lc„(x)| +E[|my jn (x) -m(x)| 2 l c = (x) 


2k 


< 8Ci 1 “ ^ + 8ar(l + IMloo) + (logn) 


3d 


Optimizing the right hand side in ck, we get 


2 k (Mf + 2 d ) \ 1/2 


E 


m ^,n( x ) - ™( X )J < 8C 1 ( 1 - ^ ( 


1 


2k 


( (logn ) 2 2 


2o k \ 1/3 


for some constant C 3 > 0. The last expression is minimized for 

1 / n 


k = C A + 


log 


log 2 + jj \ (log n) 2 y ’ 
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where C 4 = - 3 (log 2 + §) log ( d< i 6 cf 2 )- Thus, there exists a constant 
C 5 > 0 such that, for all n > 1 , 

E [m^„(x) - m(x)] 2 < Cn“ 2 /( 6 + 3 dlog 2 ) (logn) 2 . 


□ 


Lemma 2. For all k e IN and x G [0,1], 


« 


-i \ fez+i /1 

2J < ^ K%f(0,\zi-xi\)dz< (^- 


ki — 1 


(a) 


[ K%f(0,\zi-xi\)\xi-zi\dzi < [ K u k f(0,\zi-xi\)dzi. 

J[ o,i] V'V */[o,i] 

Proof of Lemma 2. Let ki G IN and x/ G [0,1]. We start by proving ( i ). Accord¬ 
ing to Proposition 6 , the connection function of uniform random forests of level 
ki takes the form 


p p 00 j p 

/ K kf (°> I z i - x i\) dz i = / e_2 “ ~n du + / 

•'[ 0 , 1 ] J-\ogxi - = p. J * J - 

00 /1 \ i+'t /*' 

/ 

§C7 


— 2u 


- log(l-Xz) 


j=ki 


e~ u --du 


-2 log Xi J 


e - u — du 


— 21og(l—xz) J 

iV +i 2\G (-2i°g^)® 


j=ki 

00 \ j+1 j 


= £ „ 
j = kl V 7 2 = 0 


j=ki 


2=0 


Therefore, 


/o 1 ] ^ 7 °’ ^ ~~ “ (0 


fez —1 


and 


J 1 ] K \ Zl - Xl ^ dZl - ^ + <k1 ~ Xl ^ (0 “ (0 


ki +1 
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Regarding the second statement of Lemma 2, we have 


(— log \xi - Zi\) j 


[ K^(0,\zi-xi\)\x t - Zi\dz t 

,, GO 

= / \xi-zi I 2 Y 

p GO 

= (xi- Zi ) 2 Y 

Jzi<Xi j=kl 

GO 

+ ( zi ~ xi) 2 Y 

j Z1>X1 


dzi 


(-log Is; - Zi\) J 


j\ 


dzi 


Zl>Xi 

GO 


/ - 2 E 

j=ki 

—?>w 


j=ki 

(-log v ) 3 


( log \xi - zi\y 

ji 


dz. 


3 ! 


dv ■ 


r GO 

/ « 2 E 

•'[0 d~Xl\ j=kl 

GO 


( log uy 


du 


/ GO ^ 7 pGO ^ 7 

e -3w y' Ydw + [ e~ 3w Y A dw 

-logfe) ~t. r- J- 108(1-*,) Yk t 3- 

L 


log(^) j=k l 
2 r °° 

3 




/-31og(> z )/ 2 j=k t 


- 2 w 




-31og(l-x ; )/2 j =kl 3 


< ~ 


)\ ki~\~ 1 / /*oo 
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/ OO ^ 7 /»00 ^ 7 

-log(x;) J = fe; J J-log(l-X;) J = fe( J 


< ' 


(I) / M K *>’N- I,l>tlz ‘' 


□ 
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